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ABSTRACT 

Estimating similarity between vertices is a fundamental issue in 
network analysis across various domains, such as social networks 
and biological networks. Methods based on common neighbors 
and structural contexts have received much attention. However, 
both categories of methods are difficult to scale up to handle large 
networks (with billions of nodes). In this paper, we propose a sam¬ 
pling method that provably and accurately estimates the similarity 
between vertices. The algorithm is based on a novel idea of ran¬ 
dom path, and an extended method is also presented, to enhance 
the structural similarity when two vertices are completely discon¬ 
nected. We provide theoretical proofs for the error-bound and con¬ 
fidence of the proposed algorithm. 

We perform extensive empirical study and show that our algo¬ 
rithm can obtain iop-k similar vertices for any vertex in a network 
approximately 300 x faster than state-of-the-art methods. We also 
use identity resolution and structural hole spanner finding, two im¬ 
portant applications in social networks, to evaluate the accuracy of 
the estimated similarities. Our experimental results demonstrate 
that the proposed algorithm achieves clearly better performance 
than several alternative methods. 

Categories and Subject Descriptors 

H. 3.3 [Information Search and Retrieval]: Text Mining; 
J.4 [Social Behavioral Sciences]: Miscellaneous; H.4.m 
[Information Systems] : Miscellaneous 

General Terms 

Algorithms, Experimentation 

Keywords 

Vertex similarity; Social network; Random path 

I. INTRODUCTION 

Estimating vertex similarity is a fundamental issue in network 
analysis and also the cornerstone of many data mining algorithms 
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such as clustering, graph matching, and object retrieval. The 
problem is also referred to as structural equivalence in previous 
work p5) , and has been extensively studied in physics, mathemat¬ 
ics, and computer science. In general, there are two basic prin¬ 
ciples to quantify similarity between vertices. The first principle 
is that two vertices are considered structurally equivalent if they 
have many common neighbors in a network. The second principle 
is that two vertices are considered structurally equivalent if they 
play the same structural role—this can be further quantified by de¬ 
gree, closeness centrality, betweenness, and other network central¬ 
ity metrics Quite a few similarity metrics have been developed 
based on the first principle, e.g., the Jaccard index (Tt) and Cosine 
similarity j^. However, they estimate the similarity in a local fash¬ 
ion. Though some work such as SimRank |18| , VertexSim |24| , 
and RoleSim (H. use the entire network to compute similarity, 
they are essentially based on the transitivity of similarity in the net¬ 
work. There are also a few studies that follow the second princi¬ 
ple. Eor example, Henderson et al. (T4) proposed a feature-based 
method, named ReEeX, to calculate vertex similarity by defining a 
vector of features for each vertex. 

Despite much research on this topic, the problem remains largely 
unsolved. The first challenge is how to design a unified method to 
accommodate both principles. This is important, as in many ap¬ 
plications, we do not know which principle to follow. The other 
challenge is the efficiency issue. Most existing methods have a high 
computation cost. SimRank results in a complexity of O (/1V | ^ (F), 
where | V| is the number of vertices in a network; d is the average 
degree of all vertices; I is the number of iterations to perform the 
SimRank algorithm. It is clearly infeasible to apply SimRank to 
large-scale networks. Eor example, in our experiments, when deal¬ 
ing with a network with 500,000 edges, even the fast (top-k) ver¬ 
sion of SimRank requires more than five days to complete the 
computation for all vertices (as shown in Eigure p^b^ . 

Thus, our goal in this work is to design a similarity method that 
is flexible enough to incorporate different structural patterns (fea¬ 
tures) into the similarity estimation and to quickly estimate vertex 
similarity in very large networks. 

We propose a sampling-based method, referred to as Panther, 
that provably and quickly estimates the similarity between vertices. 
The algorithm is based on a novel idea of random path. Specifi¬ 
cally, given a network, we perform R random walks, each starting 
from a randomly picked vertex and walking T steps. The idea be¬ 
hind this is that two vertices have a high similarity if they frequently 
appear on the same paths. We provide theoretical proofs for the 
error-bound and confidence of the proposed algorithm. Theoreti¬ 
cally, we obtain that the sample size, R—^ (log 2 (D +1 +| 
only depends on the path length T of each random walk, for a given 
error-bound £ and confidence level 1 — (5. To capture the informa- 




(a) Top-k similarity search (b) Efficiency Performance (c) Application: Identity Resolution 


Figure 1: Example of iop-k similarity search across networks and performance comparison, (a) Top-k similarity search across two 
disconnected networks; (b) Efficiency comparison of Panther and several comparison methods on a Tencent subnetwork of 443,070 
vertices and 5,000,000 edges; and (c) Accuracy performance when applying Panther++ to identity resolution pT) , an important 
application in social network. Please refer to §|^for definitions of all the comparison methods in (b) and (c). 


tion of structural patterns, we extend the proposed algorithm by 
augmenting each vertex with a vector of structure-based features. 
The resultant algorithm is referred to as Panther-i-i-. Panther-i-i- is 
not only able to estimate similarity between vertices in a connected 
network, but also capable of estimating similarity between vertices 
from disconnected networks. Figure[T^ shows an example of top- 
k similarity search across two disconnected networks, where U 4 , vq 
and U 5 are top-3 similar vertices to vq. 

We evaluate the efficiency of the methods on a microblogging 
network from Tencen{^ Figure 1(b) shows the efficiency compari¬ 
son of Panther, Panther-i-i-, and several other methods. Clearly, our 
methods are much faster than the comparison methods. 

Panther-i-i- achieves a 300 x speed-up over the fastest comparison 
method on a Tencent subnetwork of 443,070 vertices and 5,000,000 
edges. Our methods are also scalable. Panther is able to return 
iop-k similar vertices for all vertices in a network with 51,640,620 
vertices and 1,000,000,000 edges. On average, it only need 0.0001 
second to perform iop-k search for each vertex. 

We also evaluate the estimation capability of Panther-i-i-. Specif¬ 
ically, we use identity resolution and iop-k structural hole spanner 
finding, two important applications in social networks, to evaluate 
the accuracy of the estimated similarities. Figure |l(c)| shows the 
accuracy performance of Panther-i-i- and several alternative meth¬ 
ods for identity resolution. Panther-i-i- achieves clearly better per¬ 
formance than several alternative methods. All codes and datasets 
used in this paper are publicly availably 


Organization Section [^formulates the problem. In Section we 
detail the proposed methods for iop-k similarity search, and pro¬ 
vide theoretical analysis. Sectionj^presents experimental results to 
validate the efficiency and effectiveness of our methods. Sectionj^ 
reviews the related work. Finally, Sectionj^concludes the paper. 


We use M{vi) to indicate the set of neighboring vertices of ver¬ 
tex Vi. We leave the study of directed networks to future work. Our 
purpose here is to estimate similarity between two vertices, e.g., vi 
and Vj. We focus on finding top-Zc similar vertices. Precisely, the 
problem can be defined as, given a network G — (V^E^W) and a 
query vertex u G F, how to find a set of k vertices that have 
the highest similarities to vertex u, where Zc is a positive integer. 

A straightforward method to address the top-Zc similarity search 
problem is to first calculate the similarity s{vi , Vj) between vertices 
Vi and Vj using metrics such as the Jaccard index and SimRank, and 
then select a set Xy^kofk vertices that have the highest similarities 
to each vertex v. However it is in general difficult to scale up to 
large networks. One important idea is to obtain an approximate set 
X* for each vertex. From the accuracy perspective, we aim to 
minimize the difference between X* and X^^k- Formally, we can 
define the problem studied in this work as follows. 

Problem 1. Top-k similarity search. Given an undirected 
weighted network G = {V^E^W), a similarity metric s(.), and 
a positive integer Zc, any vertex v how to quickly and approx¬ 
imately retrieve the top-Zc similar vertices of u? How to guarantee 
that the difference between the two sets X* and Xy^k is less than 
a threshold £ G (0,1), i.e., 

Diff(X:,fc,X,,/e) <£ 

with a probability of at least 1 — 5. 

The difference between XI and Xy^k can be also viewed as the 
error-bound of the approximation. In the following section, we will 
propose a sampling-based method to approximate the top-Zc vertex 
similarity. We will explain in details how the method can guarantee 
the error-bound and how it is able to efficiently achieve the goal. 


2. PROBLEM FORMULATION 

We first provide necessary definitions and then formally formu¬ 
late the problem. 

Definition 1. Undirected Weighted Network. Let G = 

iy^E^W) denotes a network, where V is a set of |y| vertices and 
E cV X y is a set of |F^| edges between vertices. We use Vi 
to represent a vertex and Cij E E io represent an edge between 
vertices Vi and Vj. Let W be a weight matrix, with each element 
Wij G W representing the weight associated with edge Cij . 

^http://t.qq.com 

^ https ://github. com/yujing5b5d/rdsextr 


3. PANTHER: FAST TOP-K SIMILARITY 
SEARCH USING PATH SAMPLING 


We begin with considering some baseline solutions and then pro¬ 
pose our path sampling approach. A simple approach to the prob¬ 
lem is to consider the number of common neighbors of Vi and Vj. 
If we use the Jaccard index (nl the similarity can be defined as 


SjA{Vi,Vj) 


ISfjvi) r\J\f{vj)\ 
\j^{vi) uA/'Cvjor 


This method only considers local information and does not allow 
vertices to be similar if they do not share neighbors. 

To leverage the structural information, one can consider algo¬ 
rithms like SimRank fTS) . SimRank estimates vertex similarity by 


















iteratively propagating vertex similarity to neighbors until conver¬ 
gence (no vertex similarity changes), i.e., 


SsR{Vi,Vj) 


C 


vieM{vi) VmeJ\f{vj) 


where C is a constant between 0 and 1. 

SimRank similarity depends on the whole network and allows 
vertices to be similar without sharing neighbors. The problem 
with SimRank is its high computational complexity: 0(I\V\‘^d^), 
which makes it infeasible to scale up to large networks. Though 
quite a few studies have been conducted recently the prob¬ 

lem is still largely unsolved. 

We propose a sampling-based method to estimate the iop-k simi¬ 
lar vertices. In statistics, sampling is a widely used technique to es¬ 
timate a target distribution p6) . Unlike traditional sampling meth¬ 
ods, we propose a random path sampling method, named Panther. 
Given a network G = {V^E^W), Panther randomly generates R 
paths with length T. Then the similarity estimation between two 
vertices is cast as estimating how likely it is that two vertices appear 
on a same path. Theoretically we prove that given an error-bound, 
£, and a confidence level, 1 — (5, the sample size R is independent 
of the network size. Experimentally, we demonstrate that the error- 
bound is dependent on the number of edges of the network. 


3.1 Random Path Sampling 

The basic idea of the method is that two vertices are similar if 
they frequently appear on the same paths. The principle is similar 
to that in Katz | [20| . 


Path Similarity. To begin with, we introduce how to estimate ver¬ 
tex similarity based on T-paths. A T-path is defined as a sequence 
of vertices p = (ui, • • • , ut+i), which consists of T + 1 vertices 
and T edge^ Let It denotes all the T-paths in G. Let w{p) be the 
weight of a path p. The weight can be defined in different ways. 
Given this, the path similarity between Vi and Vj is defined as: 


SRp(yi, Vj) 


Yep.,,.. 

Yen wiP) 


( 1 ) 


where Pvi,vj is a subset of It that contain both Vi and Vj. 


Estimating Path Similarity with Random Sampling. To calcu¬ 
late the denominator in Eq Q, we need to enumerate all T-paths in 
G. However, the time complexity is exponentially proportional to 
the path length T, and thus is inefficient when T increases. There¬ 
fore, we propose a sampling-based method to estimate the path sim¬ 
ilarity. The key idea is that we randomly sample R paths from the 
network and recalculate Eq Q based on the sampled paths. 


SRp(Vi, Vj) 


Yep.,.., ^(P) 

Epsp wip) 


( 2 ) 


where P is the set of sampled paths. 

To generate a path, we randomly select a vertex in G as the start¬ 
ing point, and then conduct random walks of T steps from v using 
tij as the transition probability from vertex Vi to Vj . 




E 


vkeM{vi) 


Wik 


(3) 


where wij is the weight between vi and Vj . In a unweighted net¬ 
work, the transition probability can be simplified as 1/ \M{vi) \. 

^Vertices in the same path do not need to be distinct. 



Figure 2: Illustration of random path sampling. 


Based on the random walk theory 0, we define w{jp) as 

T 

w{p)= jq tij. 

i=l,j=i + l 

The path weight also represents the probability that a path p is 
sampled from H; thus, w{p) in Eq. is absorbed, and we can 
rewrite the equation as follows: 

SRp{vi,Vj) = —^—. ( 4 ) 

Algorithm summarizes the process for generating the R ran¬ 
dom paths. To calculate Eq. 0, the time complexity is 0{RT), 
because it has to enumerate all R paths. To improve the efficiency, 
we build an inverted index of vertex-to-path ||^. Using the index, 
we can retrieve all paths that contain a specific vertex v with a com¬ 
plexity of 0(1). Then Eq. 0 can be calculated with a complexity 
of 0{RT), where R is the average number of paths that contain 
a vertex and R is proportional to the average degree d. Eigure 
illustrates the process of random path sampling. Details of the al¬ 
gorithm are presented in Algorithm 

3.2 Theoretical Analysis 

We give theoretical analysis for the random path sampling al¬ 
gorithm. In general, the path similarity can be viewed as a prob¬ 
ability measure defined over all paths H. Thus we can adopt the 
results from Vapnik-Chernovenkis (VC) learning theory to an¬ 
alyze the proposed sampling-based algorithm. To begin with, we 
will introduce some basic definitions and fundamental results from 
Vapnik-Chemovenkis theory, and then demonstrate how to utilize 
these concepts and results to analyze our method. 

Preliminaries. Let (P, 7^) be a range space, where V denotes a 
domain, and 7^ is a range set on V. Eor any set BCD, Pr{B) = 
{B n A : A e R} is the projection of IZ on B. If Pn{B) = 2^, 
where 2^ is the powerset of T, we say that the set B is shattered 
by 1Z. The following definitions and theorem derive from p^ . 

Definition 2. The Vapnik-Chervonenkis (VC) dimension of 7^, 
denoted as VG{1Z), is the maximum cardinality of a subset of V 
that can be shattered by IZ. 

Let S = {xi, • • • , Xn} be a set of i.i.d. random variables sam¬ 
pled according to a distribution over the domain V. Eor a set 
A let 0(A) be the probability that an element sampled from 
0 belongs to A, and let the empirical estimation of 0(A) on S be 

4>s{A) = -’Y'^AiXi), 

i=l 

where 1a is the indicator function with the value of 1a (x) equals 
1 if W G A, and 0 otherwise. 
















The question of interest is that how well we can estimate 0(A) 
using its unbiased estimator, the empirical estimation 0s(A). We 
first give the goodness of approximation in the following definition. 


Definition 3. Let 7^ be a range set on D, and 0 be a probability 
distribution defined on V. For s G (0,1), an ^-approximation to 
(71, 0) is a set S of elements in V such that 

SUPAG7^|0(A) - 0s(A)| < S. 

One important result of VC theory is that if we can bound the 
y C-dimension of IZ, it is possible to build an ^-approximation by 
randomly sampling points from the domain according to the distri¬ 
bution 0. This is summarized in the following theorem. 

Theorem 1. Let 7^ be a range set on a domain V, with 
VC(71) < d, and let 0 be a distribution on D. Given s,S e (0,1) 
, let S' be a set of I aS I points sampled from V according to 0, with 

|5| = J(d + lni), 

where c is a universal positive constant. Then S is sl e- 
approximation to (71, 0) with probability of at least 1 — S. 


Range Set of Path. In our setting, we set the domain to be It— 
the set of all paths with length T in the graph G. Accordingly, we 
define the range set TZg on domain It to be 

TZg = {Pvi.vj : Vi,Vj G V}. 

It is a valid range set, since it is the collection of subsets Pvi,vj 
of domain It. We first show an upper bound of the VC dimension 
of TZg in Lemma[^ The proof is inspired by p^ . 

Lemma 1. VC (TZg) < log 2 (^) + 1 

Proof. We prove the lemma by contradiction. Assume 
VC(7Zg) = I and I > log 2 (^) + 1. By the definition of VC- 
dimension, there is a set Q C 11 of size I that can be shattered by 
7Zg’ That is, we have the following statement: 

\/Si C Q , G TZg, s.t. Pif^Q — Si, 

where Pi is the z-th range. Since each subset Si ^ Q is different 
from the other subsets, the corresponding range Pi that making PiD 
Q = Si is also different from the other ranges. Moreover, the set Q 
is shattered by TZg if and only if {Pi fl Q : G 7Z} = 2^. Thus 

Vp G Q, there are 2^“^ non-empty distinct subsets Si, - - ,S 2 i-i 
of Q containing the path p. So there are also 2^“^ distinct ranges 
in TZg that contain the path p, i.e. 

\{Pi\pePiwAPienG}\ = 2'-\ 

In addition, according to the definition of range set, TZg — 
{Pvi,vj G y}, we know that a path belongs only to the 

ranges corresponding to any pair of vertices in path p, i.e., to the 
pairwise combinations of the vertices in p. This means the num¬ 
ber of ranges in TZg that p belongs to is equal to the combinatorial 
number (^), i.e., 

|{P0p G Pi and Pi G TZg}\ = 



On the other hand, from our preliminary assumption, we have 
I > log 2 (^) + 1, which is equivalent to (^) < 2^“^. Thus, 


|{Pi|p G Pi and Pi G TZg}\ 



< 2 


i-i 


Algorithm 1: Panther 

Input: A network G = (V,E,W), path length T, parameters 
£, c, (5, a vertex v, and k. 

Output: top-k similar vertices with regard to v. 

1 Calculate sample size R—^ (log 2 (D + 1 + |’ 

2 GenerateRandomPath(G, P); 

3 foreach pn G Pv do 

4 foreach Unique Vj G Pn do 

5 L ^ ; 

6 Retrieve top-k similar vertices according to Srp(v, Vj); 


Algorithm 2: Panther-i-i- 

Input: A network G = (V, E,W), path length T, parameters 
£, c, S, vector dimension D, a vertex v, and k. 

Output: top-k similar vertices with regard to v. 

1 Calculate sample size R—^ (log 2 (D + 1 + |^ 

2 GenerateRandomPath(G, P); 

3 foreach Ui G y do 

4 foreach pn G Pvi do 

5 foreach Unique Vj G Pn do 

6 \_ SRp{vi,Vj)+= ji', 

7 Construct a vector 0(vi) by taking the largest D values 
_ from {SRp(vi,Vj) : Vj G Pn andpn G P^;J; 

8 Build a kd-tree index based on the Euclidean distance between 
any vectors 0(vi) and 0(vj) ; 

9 Query the top-k similar vertices from the index for v; 


Hence, we reach a contradiction: it is impossible to have 
distinct ranges Pi G TZg containing p. Since there is a one-to-one 
correspondence between Si and Pi, we get that it is also impos¬ 
sible to have 2^“^ distinct subset Si E Q containing p. There¬ 
fore, we prove that Q cannot be shattered by TZg and VC (TZg) < 
(2) + n 

Sample Size Guarantee. We now provide theoretical guarantee 
for the number of sampled paths. How many random paths do we 
need to achieve an error-bound e: with probability 1 — (5? We define 
a probability distribution on the domain H. Vp G H, we define 

We can see that the definition of SRp(vi,Vj) in Eq.([T} is equiv¬ 
alent to (j)(Pv^^v.). This observation enables us to use a sampling- 
based method (empirical average) to estimate the original path sim¬ 
ilarity (true probability measure). 

Plugging the result of Lemma ^ into Theorem we obtain: 

R = 0(l°g2 ^2 j + 1 + In 0. 

That is, with at least P random paths, we can estimate the path 
similarity between any two vertices with the desired error-bound 
and confidence level. The above equation also implies that the sam¬ 
ple size P only depends on the path length T, given an error-bound 
£, and a confidence level 1 — S. 

3.3 Panther++ 

One limitation of Panther is that the similarities obtained by the 
algorithm have a bias to close neighbors, though in principle it con- 












Algorithm 3: GenerateRandomPath 

Input: A network G = {V^E^W) and sample size R. 
Output: Paths {pr}?=i and vertex-to-path index 

1 Calculate transition probabilities between every pair of 
vertices according to Eq. ^ ; 

2 Initialize r = 1; 

3 repeat 

4 Sample current vertex v — vi uniformly at random ; 

5 Add V into pr and add pr into the path set of u, i.e., Py ; 

6 repeat 

7 Randomly sample a neighbor vj according to 
transition probabilities from v to its neighbors; 

8 Set current vertex v — Vj\ 

9 Add V into pr and add pr into Py ; 

10 until |pr| < T + 1; 

11 r+ = 1; 

12 until r < R\ 


siders the structural information. We therefore present an extension 
of the Panther algorithm. The idea is to augment each vertex with 
a feature vector. To construct the feature vector, we follow the in¬ 
tuition that the probability of a vertex linking to all other vertices is 
similar if their topology structures are similar eg We select the 
iop-D similarities calculated by Panther to represent the probabil¬ 
ity distribution. Specifically, for vertex Vi in the network, we first 
calculate the similarity between Vi and all the other vertices using 
Panther. Then we construct a feature vector for Vi by taking the 
largest D similarity scores as feature values, i.e., 

0{Vi) = {SRp{Vi, U(1)), SRp{Vi,V(^2)), • • • , SRp{Vi,V(^D))), 

where SRp{vi,V(^d)) denotes the d-ih largest path similarity be¬ 
tween Vi and another vertex U(c^). 

Finally, the similarity between Vi and vj is re-calculated as the 
reciprocal Euclidean distance between their feature vectors: 

The idea of using vertex features to estimate vertex similarity 
was also used for graph mining (T4). 

Index of Feature Vectors Again, we use the indexing techniques 
to improve the algorithm efficiency. We build a memory based kd- 
tree pT) index for feature vectors of all vertices. Then given a 
vertex, we can retrieve top-k vertices in the kd-tree with the least 
Euclidean distance to the query vertex efficiently. At a high level, a 
kd-tree is a generalization of a binary search tree that stores points 
in D-dimensional space. In level /i of a kd-tree, given a node u, 
the h%D-ih element in the vector of each node in its left subtree is 
less than the h%D-ih element in the vector of v, while the h%D-ih 
element of every node in the right subtree is no less than the h%D- 
th element of v. Figurej^shows the data structure of the index built 
in Panther-i-i-. Based on the index, we can query whether a given 
point is stored in the index very fast. Specifically, given a vertex 
V, if the root node is v, return the root node. If the first element 
of V is strictly less than the first element of the root node, look for 
V in the left subtree, then compare it to the second element of v. 
Otherwise, check the right subtree. It is worth noting that we can 
easily replace kd-tree with any other index methods, such as r-tree. 
The algorithms for calculating feature vectors of all vertices and 
the similarity between vertices are shown in Algorithmic 

Implementation Notes. In our experiments, we empirically set 



Figure 3: Data structure of the index built in Panther-n-. 


Table 1: Time and space complexity for calculating top-Zc sim¬ 
ilar vertices for all vertices in a network. I — number of it¬ 
erations, d —average degree, /—feature number, D — vector 
dimension, and T — path length. 


Method 

Time Complexity 

Space Complexity 

SimRank jisj 

0{I\V\'^(P) 

o(|y|2) 

TopSim jSl 

o(\v\t<F) 

o(|v| + |£|) 

RWR 

0{I\V\^d) 

o(|y|2) 

RoleSirn |l9| 

0(I\V\^cP) 

o(|y|2) 

ReFex |l4j 

o(|y|+/(/|E| + |v|/2)) 

0(\V\ + \E\f) 

Panther 

0(RTc+\V\dT) 

0(RT + \V\d) 

Panther++ 

0{RTc+\V\dT + \V\c) 

OiRT+\V\d+\V\D) 


the parameters as follows: c = 0.5, (5 = 0.1, T = 5, D = 50 and 
£ = The optimal values of T, D and e are discussed in 

section 1C We build the kd-tree using the toolkit ANlsj^ 

3.4 Complexity Analysis 

In general, existing methods result in high complexities. For 
example, the time complexity of SimRank p8], TopSim p^ . 
Random walk with restart (RWR) 1^ , RoleSim | [T^ , and 
ReFex is 0{I\V\^^), 0{\V\Td^), 0{I\V\^d), 0{I\V\^^), 
and 0(fyf + I{f\E\ + \V\f^)), respectively. Tablesummarizes 
the time and space complexities of the different methods. For Pan¬ 
ther, its time complexity includes two parts: 

• Random path sampling: The time complexity of generat¬ 
ing random paths is 0{RT log J), where log d is very small 
and can be simplified as a small constant c. Hence, the time 
complexity is 0{RTc). 

• Top-Zc similarity search: The time complexity of calcu¬ 
lating top-Zc similar vertices for all vertices is 0{\V\RT + 
\V\M). The first part 0{\V\RT) is the time complexity of 
calculating Eq. 0 for all pairs of vertices, where R is the 
average number of paths that contain a vertex and is propor¬ 
tional to the average degree d. The second part 0(|y |M) is 
the time complexity of searching top-Zc similar vertices based 
on a heap structure, where M represents the average number 
of co-occurred vertices with a vertex and is proportional to 
d. Hence, the time complexity is 0{\V\dT). 

The space complexity for storing paths and vertex-to-path index 
is 0{RT) and 0{\V\d), respectively. 

Panther-i-i- requires additional computation to build the kd-tree. 
The time complexity of building a kd-tree is 0(|y| log |y|) and 
querying top-Zc similar vertices for any vertex is 0(|y| log |y|), 
where log \ V\ is small and can be viewed as a small constant c. 
Additional space (with a complexity of 0{\V\D)) is required to 
store |y I vectors with D-dimension. 

"^http ://ww w.cs .umd. edu/ mount/ANN/ 


































4. EXPERIMENTS 
4.1 Experimental Setup 

In this section, we conduct various experiments to evaluate the 
proposed methods for iop-k similarity search. 

Datasets. We evaluate the proposed method on four different 
networks: Tencent, Twitter, Mobile, and co-author. 

Tencent The dataset is from Tencent Weibcj^ a popular 
Twitter-like microblogging service in China, and consists of over 
355,591,065 users and 5,958,853,072 “following” relationships. 
The weight associated with each edge is set as 1.0 uniformly. This 
is the largest network in our experiments. We mainly use it to eval¬ 
uate the efficiency performance of our methods. 

Twitter (16) : The dataset was crawled in the following way. We 
first selected the most popular user on Twitter, i.e., “Lady Gaga”, 
and randomly selected 10,000 of her followers. We then collected 
all followers of these users. In the end, we obtained 113,044 users 
and 468,238 “following” relationships in total. The weight associ¬ 
ated with each edge is also set as 1.0 uniformly. We use this dataset 
to evaluate the accuracy of Panther and Panther-i-i-. 

Mobile (3: The dataset is from a mobile communication com¬ 
pany, and consists of millions of call records. Each call record con¬ 
tains information about the sender, the receiver, the starting time, 
and the ending time. We build a network using call records within 
two weeks by treating each user as a vertex, and communication be¬ 
tween users as an edge. The resultant network consists of 194,526 
vertices and 206,934 edges. The weight associated with each edge 
is defined as the number of calls. We also use this dataset to evalu¬ 
ate the accuracy of the proposed methods. 

Co-author (34) : The datasej^is from AMiner.org, and contains 
2,092,356 papers. From the original citation data, we extracted a 
weighted co-author graph from each of the following conferences 
from 2005 to 2013: KDD, ICDM, SIGIR, CIKM, SIGMOD, ICDE, 
and ICMiQ The weight associated with each edge is the number of 
papers collaborated on by the two connected authors. We also use 
the dataset to evaluate the accuracy of the proposed methods. 

Evaluation Aspects. To quantitatively evaluate the proposed 
methods, we consider the following performance measurements: 

Efficiency Performance: We apply our methods to the Tencent 
network to evaluate the computational time. 

Accuracy Performance: We apply the proposed methods to rec¬ 
ognize identical authors on different co-author networks. We also 
apply our methods to the Coauthor, Twitter and Mobile networks 
to evaluate how they estimate the iop-k similarity search results. 

Parameter Sensitivity Analysis: We analyze the sensitivity of 
different parameters in our methods: path length T, vector dimen¬ 
sion 79, and error-bound s. 

Finally, we also use several case studies as anecdotal evidence to 
further demonstrate the effectiveness of the proposed method. All 
codes are implemented in C-i-i- and compiled using GCC 4.8.2 with 
-03 fiag. The experiments were conducted on a Ubuntu server with 
four Intel Xeon(R) CPU E5-4650 (2.70GHz) and IT RAM. 

Comparison methods. We compare with the following methods: 

RWR (28) : Starts from Vi, iteratively walks to its neighbors with 
the probability proportional to their edge weights. At each step, it 

^http://t.qq.com 
^ http: //aminer. org/citation 

^ Numbers of vertices/edges of different conferences are: KDD: 
2,867/ 7,637, ICDM: 2,607/4,774, SIGIR: 2,851/6,354, CIKM: 
3,548/7,076, SIGMOD: 2,616/8,304, ICDE: 2,559/6,668, and 
ICML: 3511/6105. 


also has some probability to walk back to Vi (set as 0.1). The sim¬ 
ilarity between Vi and Vj is defined as the steady-state probability 
that Vi will finally reach at vj . We calculate RWR scores between 
all pairs and then search the iop-k similar vertices for each vertex. 

TopSim (^ : Extends SimRank |T8| on one graph G to finding 
iop-k authoritative vertices on the product graph G x G efficiently. 

RoleSim (^: Refines SimRank (T^ by changing the average 
similarity of all neighbor pairs to all matched neighbor pairs. We 
calculate RoleSim scores between all pairs and then search the top- 
k similar vertices for each vertex. 

ReFeX (^: Defines local, egonet, and recursive features to cap¬ 
ture the structural characteristic. Local feature is the vertex degree. 
Egonet features include the number of within-egonet edges and the 
number of out-egonet edges. For weighted networks, they contain 
weighted versions of each feature. Recursive features are defined 
as the mean and sum value of each local or egonet feature among 
all neighbors of a vertex. In our experiments, we only extract re¬ 
cursive features once and construct a vector for each vertex by a 
total of 18 features. For fair comparison, to search iop-k similar 
vertices, we also build the same kd-tree as that in our method. 

The codes of TopSim, RoleSim, and ReFex are provided by the 
authors of the original papers. We tried to use the fast versions of 
TopSim and RoleSim mentioned in their paper. 

4.2 Efficiency and Scalability Performance 

In this subsection, we first fix /c = 5, and evaluate the efficiency 
and scalability performance of different comparison methods us¬ 
ing the Tencent dataset. We evaluate the performance by randomly 
extracting different (large and small) versions of the Tencent net¬ 
works. For TopSim and RoleSim, we only show the computational 
time for similarity search. For ReFex, Panther, and Panther-i-i-, we 
also show the computational time used for preprocessing. 

Table [2] lists statistics of the different Tencent sub-networks and 
the efficiency performance of the comparison methods. Clearly, 
our methods (both Panther and Panther-i-i-) are much faster than the 
comparison methods. For example, on the Tencent6 sub-network, 
which consists of 443,070 vertices and 5,000,000 edges. Panther 
achieves a 390 x speed-up , compared to the fastest (ReFeX) of all 
the comparative methods. 

Figure |4(a)| shows the speed-up of Panther-i“i- compared to 
ReFeX on different scales of sub-networks. The speed-up is mod¬ 
erate when the size of the network is small (\E\ < 1, 000,000); 
when continuing to increase the size of the network, the obtained 
speed-up is even superlinear. We conducted a result comparison 
between ReFeX and Panther-i-i-. The results of Panther-i-i- are very 
similar to those of ReFex, though they decrease slightly when the 
size of the network is small. Figure [4(b^ shows the efficiency per¬ 
formance of Panther and Panther-i-i- by varying the values of k from 
5 to 100. We can see that the time costs of Panther and Panther-i-i- 
are not very sensitive to k. The growth of time cost is slow when 
k gets larger. This is because k is only related to the time com¬ 
plexity of iop-k similarity search based on a heap structure. When 
k gets larger, the time complexity approximates to O(MlogM) 
from 0(M), where M is the average number of co-occurred ver¬ 
tices on the same paths. We can also see that the time cost is not 
very stable when k gets larger, because the paths are randomly gen¬ 
erated, which results in different values of M each time. 

From Table we can also see that RWR, TopSim and RoleSim 
cannot complete iop-k similarity search for all vertices within a 
reasonable time when the number of edges increases to 500,000. 
ReFeX can deal with larger networks, but also fails when the edge 
number increases to 10,000,000. Our methods can scale up to han¬ 
dle very large networks with more than 10,000,000 edges. On aver- 




Performance ratio 


Table 2: Efficiency performance (CPU time) of comparison methods on different sizes of the Tencent sub-networks. The time 
includes all computational cost for processing and top-/c similarity search for all vertices. The time before denotes the time used 
for processing and the time after denotes that used for top-Zc similarity search. “—” indicates that the corresponding algorithm 
cannot finish the computation within a reasonable time. 


Sub-network 

IVI 

lEI 

RWR 

TopSim 

RoleSim 

ReFeX 

Panther 

Panther-h-h 

Tencent 1 

6,523 

10,000 

-h7.79hr 

-1-28.58m 

-h37.26s 

3.85s-i-0.07s 

0.07s-h0.26s 

0.99s-h0.21s 

Tencent2 

25,844 

50,000 

-h>150hr 

-hll.20hr 

-hl2.98m 

26.09s-i-0.40s 

0.28s-hl.53s 

2.45s-h4.21s 

TencentS 

48,837 

100,000 

— 

-h30.94hr 

-hl.06hr 

2.02m-i-0.57s 

0.58s-h 3.48s 

5.30s-h5.96s 

Tencent4 

169,209 

500,000 

— 

-h>120hr 

-h>72hr 

17.18m-i-2.51s 

8.19s-hl6.08s 

27.94s-h24.17s 

Tencent5 

230,103 

1,000,000 

— 

— 

— 

31.50m-h3.29s 

15.31s-h30.63s 

49.83s-h22.86s 

Tencentb 

443,070 

5,000,000 

— 

— 

— 

24.15hr-h8.55s 

50.91s-h2.82m 

4.01m-hl.29m 

Tencent? 

702,049 

10,000,000 

— 

— 

— 

>48hr 

2.21m-h6.24m 

8.60m-h6.58m 

TencentS 

2,767,344 

50,000,000 

— 

— 

— 

— 

15.78m-hl.36hr 

1.60hr-h2.17hr 

Tencent9 

5,355,507 

100,000,000 

— 

— 

— 

— 

44.09m -h4.50hr 

5.61hr -h6.47hr 

Tencent 10 

26,033,969 

500,000,000 

— 

— 

— 

— 

4.82hr -h25.01hr 

32.90hr -h47.34hr 

Tencent 11 

51,640,620 

1,000,000,000 

— 

— 

— 

— 

13.32hr-h80.38hr 

98.15hr-hl20.01hr 



(a) Speed-up (b) Effect of k 


Figure 4: (a) Performance ratio is calculated by score(Xther% ’ 
where score is evaluated by the application of structural hole 
spanner finding (see § |4.3| for details.); Speed-up is calculated 
•’y Timlcplntei-f) ’ Effect of k on the efficiency performance 
of Panther and Panther++. 


age, Panther only needs 0.0001 second to perform top-Zc similarity 
search for each vertex in a large network. 

4.3 Accuracy Performance with Applications 

Identity Resolution. It is difficult to find a ground truth to eval¬ 
uate the accuracy for similarity search. To quantitatively evaluate 
the accuracy of the proposed methods and compare with the other 
methods, we consider an application of identity resolution on the 
co-author network. The idea is that we first use the authorship at 
different conferences to generate multiple co-author networks. An 
author may have a corresponding vertex in each of the generated 
networks. We assume that the same authors in different networks 
of the same domain are similar to each other. We anonymize author 
names in all the networks. Thus given any two co-author networks, 
for example KDD-ICDM, we perform a top-Zc search to find similar 
vertices from ICDM for each vertex in KDD by different methods. 
If the returned k similar vertices from ICDM by a method consists 
of the corresponding author of the query vertex from KDD, we say 
that the method hits a correct instance. A similar idea was also 
employed to evaluate similarity search in (TT) Please note that 
the search is performed across two disconnected networks. Thus, 
RWR, TopSim and RoleSim cannot be directly used for solving the 
task. ReFex calculates a vector for each vertex, and can be used 
here. Additionally, we also compare with several other methods 
including Degree, Clustering Coefficient, Closeness, Betweenness 


and Pagerank. In our methods. Panther is not applicable to this sit¬ 
uation. We only evaluate Panther-i-i- here. Additionally, we also 
show the performance of random guess. 

Figure [^presents the performance of Panther-i-i- on the task of 
identity resolution across co-author networks. We see that Pan- 
ther-h-i- performs the best on all three datasets. ReFex performs 
comparably well; however, it is not very stable. In the SIGMOD- 
ICDE case, it performs the same as Panther-i-i-, while in the KDD- 
ICDM and SIGIR-CIKM cases, it performs worse than Panther-H-H, 
when Zc < 60. 

Approximating Common Neighbors. We evaluate how Panther 
can approximate the similarity based on common neighbors. The 
evaluation procedure is described as follows: 

1. For each vertex u in the seed set S, generate top k vertices 

Top A k (^) most similar to u by the algorithm A. 

2. For each vertex v G Top^ ki'^)’ calculate g(u,v), where g 
is a coarse similarity measure defined as the ground truth. 
Define fA,k = J2u 

3. Similarly, let fu^k denotes the result of a random algorithm. 

4. Finally, we define the score for algorithm A as score(A, k) = 

, which represents the improvement of algorithm 
A over a random-based method. 

Specifically, we define g(u,v) to be the number of common 
neighbors between u and v on each dataset. 

Figure shows the performance of Panther evaluated on the 
ground truth of common neighbors in different networks. Some 
baselines such as RWR and RoleSim are ignored on different 
datasets, because they cannot complete top-Zc similarity search for 
all vertices within a reasonable time. It can be seen that Panther per¬ 
forms better than any other methods on most datasets. Panther-i-i-, 
ReFex and Rolesim perform worst since they are not devised to 
address the similarity between near vertices. Our method Panther 
performs as good as TopSim, the top-Zc version of SimRank, be¬ 
cause they both based on the principle that two vertices are consid¬ 
ered structuraly equivalent if they have many common neighbors in 
a network. However, according to our previous analysis, TopSim 
performs much slower than Panther. 

Top-Zc Structural Hole Spanner Finding. The other application 
we consider in this work is top-Zc structural hole spanner finding. 

































(a) KDD-ICDM (b) SIGIR-CIKM (c) SIGMOD-ICDE 


Figure 5: Performance of identity resolution across two networks with different comparison methods. 
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(c) Twitter 


(d) Mobile 


(c) Twitter 


(d) Mobile 


Figure 6: Performance of Panther evaluated on the ground Figure 7: Performance of mining structural hole spanners on 
truth of common neighbors. the Twitter and Mobile networks with different methods. 


The theory of structural holes Q suggests that, in social networks, 
individuals would benefit from filling the “holes” between people 
or groups that are otherwise disconnected. The problem of finding 
iop-k structural hole spanners was proposed in which also 
shows that 1% of users who span structural holes control 25% of 
the information diffusion (retweeting) in Twitter. 

Structural hole spanners are not necessarily connected, but they 
share the same structural patterns such as local clustering coeffi¬ 
cient and centrality. Thus, the idea here is to feed a few seed users 
to the proposed Panther+-H, and use it to find other structural hole 
spanners. For evaluation, we use network constraint p) to obtain 
the structural hole spanners in Twitter and Mobile, and use this as 
the ground truth. Then we apply different methods—Panther-i-i-, 
ReFex, Panther, and SimRank—to retrieve iop-k similar users for 
each structural hole spanner. If an algorithm can find another struc¬ 
tural hole spanner in the iop-k returned results, then it makes a cor¬ 
rect search. We define g{u, v) = 1, if both u and v are structural 
hole spanners, and g{u, u) = 0 otherwise. 

Figure [7] shows the performance of comparison methods for 
finding structural hole spanners in different networks. Panther-i-i- 
achieves a consistently better performance than the comparison 
methods by varying the value of k. TopSim, the iop-k version of 
SimRank seems inapplicable to this task. This is reasonable, as 


the underlying principle of SimRank is to find vertices with more 
connections to the query vertex. 

4.4 Parameter Sensitivity Analysis 

We now discuss how different parameters infiuence the perfor¬ 
mance of our methods. 

Effect of Path Length T. Figure shows the accuracy perfor¬ 
mance of Panther-i-i- for mining structural holes by varying the path 
length T as 2, 5,10, 20,50 and 100. A too small T(< 5) would re¬ 
sult in inferior performance. On Twitter, when increasing its value 
up to 5, it almost becomes stable. On Mobile, the situation is a bit 
complex, but in general T = 5 seems to be a good choice. 

Effect of Vector Dimension D, Figure shows the accuracy 
performance of Panther-i-i- for mining structural hole spanners by 
varying vector dimension D as 2, 5, 10, 20, 50 and 100. Gener¬ 
ally speaking, the performance gets better when D increases and 
it remains the same after D gets larger than 50. This is reason¬ 
able, as Panther estimates the distribution of a vertex linking to the 
other vertices. Thus, the higher the vector dimension, the better 
the approximation. Once the dimension exceeds a threshold, the 
performance gets stable. 






























































































Figure 8: Effect of path length T on the accuracy performance 
of Panther++. 




Figure 9: Effect of vector dimension D on the accuracy per¬ 
formance of Panther++. 


Effect of Error Bound 5 . Figure shows the accuracy per¬ 
formance of Panther and Panther-i-i- on the Tencent networks with 
different scales by varying error-bound £ from 0.06 to 0.0001. We 
evaluate how Panther can estimate the similarity based on common 
neighbors. Specifically, we use the same evaluation methods as 
structural hole spanner finding and define g{u, v) to be the num¬ 
ber of common neighbors between u and v on each dataset. We 
see that when the ratio ranges from 5 to 20, scores of Pan¬ 

ther are almost convergent on all the datasets. And when the ratio 
ranges from 0.2 to 5, the scores of Panther-i-i- are almost con¬ 
vergent on all the datasets. Thus we can reach the conclusion that 
the value of ( 1 /e)^ is almost linearly positively correlated with the 
number of e dges in a network. Therefore we can empirically esti¬ 
mate £ = ^/l/\E\ in our experiments. 

4.5 Qualitative Case study 

Now we present two case studies to demonstrate the effective¬ 
ness of the proposed methods. 

“Similar Researchers” Table shows an example of top-5 simi- 




(b) Panther-i-i- 


Figure 10: Effect of error-bound e on the performance of Pan¬ 
ther and Panther-n- on different sizes of Tencent networks. 





Eigure 11: Case study in a scientific co-author network (27| . 
The authors in similar positions to that of Barabasi are denoted 
in green, similar to that of Robert are in red, and similar to that 
of Rinzel are in blue. Others are in yellow. 


lar authors to Jiawei Han, Michael I. Jordan, and W. Bruce Croft, 
found by Panther and Panther-i-i-. The two methods present very 
different results. Those authors found by Panther have closer con¬ 
nections with the query author. While those authors found by Pan- 
ther-h-i- have a similar “social status” (essentially similar structural 
patterns) to the query author. For example, Philip S. Yu and Chris¬ 
tos Faloutsos are two researchers as famous as Jiawei Han in the 
data mining field (KDD). Andrew Y. Ng and Bernhard Scholkopft 
are influential researchers similar to Michael I. Jordan in the ma¬ 
chine learning field (ICML). 

“Who is similar to Barabasi?” Albert-Laszlo Barabasi is a fa¬ 
mous Hungarian-American physicist, who proposed the Barabasi- 
Albert (BA) model for generating random scale-free networks us¬ 
ing a preferential attachment mechanism. We apply Panther-i-i- to 
a scientific network [T^ |27) to find researchers who have simi¬ 
lar structural positions to that of Dr. Barabasi. It is interesting 
that different researchers play different roles in the network. Mark 
Newman and Vito Latora have similar structural patterns to that of 
Dr. Barabasi. Some other researchers like Robert form a tight- 
knit group with him. Panther-i-i- successfully recognizes those re¬ 
searchers with similar structural positions. 

5 . RELATED WORK 

Early similarity measures, including bibliographical cou¬ 
pling and co-citation are based on the assumption that 
two vertices are similar if they have many common neighbors. This 
category of methods cannot estimate similarity between vertices 
without common neighbors. Several measures have been proposed 
to address this problem. For example, Katz pO) counts two ver¬ 
tices as similar if there are more and shorter paths between them. 
Tsourakakis et al. 1^ learn a low-dimension vector for each ver¬ 
tex from the adjacent matrix and calculate similarities between the 





















































Table 3: Case study of top-5 similar authors in KDD, ICML and SIGIR networks. 


Jiawei Han 

Michael I. Jordan 

W. Bruce Croft 

Panther 

Panther-i-i- 

Panther 

Panther-i-i- 

Panther 

Panther-i-i- 

Chi Wang 
Jing Gao 
Xifeng Yan 
YiZhou Sun 
Philip S. Yu 

Philip S. Yu 
Christos Ealoutsos 
Jeping Ye 

Naren Ramakrishnan 
Ravi Kumar 

Eric p. Xing 

Percy Liang 
Lester W. Mackey 
Gert R. G. Lanckriet 
Purnamrita Sarkar 

Andrew Y. Ng 
Bernhard Scholkopf 
Zoubin Ghahramani 
Michael 1. Littman 
Thomas G. Dietterich 

Michael Bendersky 
Trevor Strohman 
Jang won Seo 
Donald Metzler 
Jiwoon Jeon 

Leif Azzopardi 
Maarten de Rijke 
Zheng Chen 
Ryen w. White 
Chengxiang Zhai 


vectors. Jeh and Widom fTS) propose a new algorithm, SimRank. 
The algorithm follows a basic recursive intuition that two nodes are 
similar if they are referenced by similar nodes. VertexSim (13 is 
an extension of SimRank. However, all the SimRank-based meth¬ 
ods share a common drawback: their computational complexities 
are too high. Further studies have been done to reduce the compu¬ 
tational complexity of SimRank |[T2]J22]J^. Fast-random-walk- 
based graph similarity, such as in |10||31| 7has also been studied 
recently. Sun et al. measure similarities between vertices based 
on their inter-paths instantiated from different schemes defined in 
a heterogeneous information network. The setting is different from 
ours and the algorithm is not efficient. 

Most aforementioned methods cannot handle similarity estima¬ 
tion across different networks. Blondel et al. provide a HITS- 
based recursive method to measure similarity between vertices 
across two different graphs. RoleSim (T^ can also calculate the 
similarity between disconnected vertices. Similar to SimRank, 
the computational complexity of the two methods is very high. 
Feature-based methods can match vertices with similar structures. 
For example, Burt ||4j counts the 36 kinds of triangles in one’s ego 
network to represent a vertex’s structural characteristic. In the same 
way, vertex centrality, closeness centrality, and betweenness cen¬ 
trality |[^ of two different vertices can be compared, to produce 
a structural similarity measure. Aoyama et al. Q present a fast 
method to estimate similarity search between objects, instead of 
vertices in networks. ReFex fT?) [TS) defines basic features such 
as degree, the number of within/out-egonet edges, and define the 
aggregated values of these features over neighbors as recursive fea¬ 
tures. The computational complexity of ReFex depends on the 
recursive times. More references about feature-based similarity 
search in networks can be found in the survey | |30| . 

6. CONCLUSION 

In this paper, we propose a sampling method to quickly estimate 
iop-k similarity search in large networks. The algorithm is based 
on the idea of random path and an extended method is also pre¬ 
sented to enhance the structural similarity when two vertices are 
completely disconnected. We provide theoretical proofs for the 
error-bound and confidence of the proposed algorithm. We per¬ 
form an extensive empirical study and show that our algorithm can 
obtain iop-k similar vertices for any vertex in a network approx¬ 
imately 300 X faster than state-of-the-art methods. We also use 
identity resolution and structural hole spanner finding, two impor¬ 
tant applications in social networks, to evaluate the accuracy of the 
estimated similarities. Our experimental results demonstrate that 
the proposed algorithm achieves clearly better performance than 
several alternative methods. 
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